nlp_architect.data.babi_dialog.BABI_Dialog

class nlp_architect.data.babi_dialog.BABI_Dialog(path='.', task=1, oov=False, use_match_type=False, use_time=True, use_speaker_tag=True, cache_match_type=False, cache_vectorized=False)[source]

This class loads in the Facebook bAbI goal oriented dialog dataset and vectorizes them into user utterances, bot utterances, and answers.

As described in: “Learning End-to-End Goal Oriented Dialog”. https://arxiv.org/abs/1605.07683.

For a particular task, the class will read both train and test files and combine the vocabulary.

Parameters
  • path (str) – Directory to store the dataset

  • task (str) – a particular task to solve (all bAbI tasks are train and tested separately)

  • oov (bool, optional) – Load test set with out of vocabulary entity words

  • use_match_type (bool, optional) – Flag to use match-type features

  • use_time (bool, optional) – Add time words to each memory, encoding when the memory was formed

  • use_speaker_tag (bool, optional) – Add speaker words to each memory (<BOT> or <USER>) indicating who spoke each memory.

  • cache_match_type (bool, optional) – Flag to save match-type features after processing

  • cache_vectorized (bool, optional) – Flag to save all vectorized data after processing

data_dict

Dictionary containing final vectorized train, val, and test datasets

Type

dict

cands

Vectorized array of potential candidate answers, encoded

Type

np.array

as integers, as returned by BABI_Dialog class. Shape = [num_cands, max_cand_length]
num_cands

Number of potential candidate answers.

Type

int

max_cand_len

Maximum length of a candidate answer sentence in number of words.

Type

int

memory_size

Maximum number of sentences to keep in memory at any given time.

Type

int

max_utt_len

Maximum length of any given sentence / user utterance

Type

int

vocab_size

Number of unique words in the vocabulary + 2 (0 is reserved for a padding symbol, and 1 is reserved for OOV)

Type

int

use_match_type

Flag to use match-type features

Type

bool, optional

kb_ents_to_type

For use with match-type features, dictionary of entities found in the dataset mapping to their associated match-type

Type

dict, optional

kb_ents_to_cand_idxs

For use with match-type features, dictionary mapping from each entity in the knowledge base to the set of indicies in the candidate_answers array that contain that entity.

Type

dict, optional

match_type_idxs

For use with match-type features, dictionary mapping from match-type to the associated fixed index of the candidate vector which indicated this match type.

Type

dict, optional

__init__(path='.', task=1, oov=False, use_match_type=False, use_time=True, use_speaker_tag=True, cache_match_type=False, cache_vectorized=False)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__([path, task, oov, use_match_type, …])

Initialize self.

clean_cands(cand)

Remove leading line number and final newline from candidate answer

compute_statistics()

Compute vocab, word index, and max length of stories and queries.

create_cands_mat(data_split, cache_match_type)

Add match type features to candidate answers for each example in the dataaset.

create_match_maps()

Create dictionary mapping from each entity in the knowledge base to the set of indicies in the candidate_answers array that contain that entity.

encode_match_feats()

Replace entity names and match type names with indexes

get_vocab(dialog)

Compute vocabulary from the set of dialogs.

load_candidate_answers()

Load candidate answers from file, compute number, and store for final softmax

load_data()

Fetch and extract the Facebook bAbI-dialog dataset if not already downloaded.

load_kb()

Load knowledge base from file, parse into entities and types

one_hot_vector(answer)

Create one-hot representation of an answer.

parse_dialog(fn[, use_time, use_speaker_tag])

Given a dialog file, parse into user and bot utterances, adding time and speaker tags.

process_interactive(line_in, context, …)

Parse a given user’s input into the same format as training, build the memory from the given context and previous response, update the context.

vectorize_cands(data)

Convert candidate answer word data into vectors.

vectorize_stories(data)

Convert (memory, user_utt, answer) word data into vectors.

words_to_vector(words)

Convert a list of words into vector form.

static clean_cands(cand)[source]

Remove leading line number and final newline from candidate answer

compute_statistics()[source]

Compute vocab, word index, and max length of stories and queries.

create_cands_mat(data_split, cache_match_type)[source]

Add match type features to candidate answers for each example in the dataaset. Caches once complete.

create_match_maps()[source]

Create dictionary mapping from each entity in the knowledge base to the set of indicies in the candidate_answers array that contain that entity. Will be used for quickly adding the match type features to the candidate answers during fprop.

encode_match_feats()[source]

Replace entity names and match type names with indexes

get_vocab(dialog)[source]

Compute vocabulary from the set of dialogs.

load_candidate_answers()[source]

Load candidate answers from file, compute number, and store for final softmax

load_data()[source]

Fetch and extract the Facebook bAbI-dialog dataset if not already downloaded.

Returns

training and test filenames are returned

Return type

tuple

load_kb()[source]

Load knowledge base from file, parse into entities and types

one_hot_vector(answer)[source]

Create one-hot representation of an answer.

Parameters

answer (string) – The word answer.

Returns

One-hot representation of answer.

Return type

list

static parse_dialog(fn, use_time=True, use_speaker_tag=True)[source]

Given a dialog file, parse into user and bot utterances, adding time and speaker tags.

Parameters
  • fn (str) – Filename to parse

  • use_time (bool, optional) – Flag to append ‘time-words’ to the end of each utterance

  • use_speaker_tag (bool, optional) – Flag to append tags specifiying the speaker to each utterance.

process_interactive(line_in, context, response, db_results, time_feat)[source]

Parse a given user’s input into the same format as training, build the memory from the given context and previous response, update the context.

vectorize_cands(data)[source]

Convert candidate answer word data into vectors.

If sentence length < max_cand_len it is padded with 0’s

Parameters

data (list of lists) – list of candidate answers split into words

Returns

padded numpy array of word indexes forr all candidate answers

Return type

tuple (2d numpy array)

vectorize_stories(data)[source]

Convert (memory, user_utt, answer) word data into vectors.

If sentence length < max_utt_len it is padded with 0’s If memory length < memory size, it is padded with empty memorys (max_utt_len 0’s)

Parameters

data (tuple) – Tuple of memories, user_utt, answer word data.

Returns

Tuple of memories, memory_lengths, user_utt, answer vectors.

Return type

tuple

words_to_vector(words)[source]

Convert a list of words into vector form.

Parameters

words (list) – List of words.

Returns

Vectorized list of words.

Return type

list